Statistical learning: tree-based methods

MACS 30100
University of Chicago

February 27, 2017

Decision trees

  • Intuitive
  • Regression and classification
  • Split observations into regions, and base predictons on mean or mode in that region

Single predictor

Stratification

  1. Divide the predictor space (\(X_1, X_2, \dots, X_p\)) into \(J\) distinct and non-overlapping regions \(R_1, R_2, \dots, R_J\).
  2. For every observation in region \(R_j\), we make the same prediction which is the mean of the response variable \(Y\) for all observations in \(R_j\).
  • Iterative process

Multiple predictors

Additional nodes

Estimation procedure

  • Reduce the sum of the squared error

    \[\sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\]

  • Recursive binary strategy
    • Top-down
    • Greedy
  • Continue until stopping point reached

Pruning the tree

Pruning the tree

  • Balance accuracy and complexity
  • Cost complexity pruning
  • Tuning parameter
  • Use with \(k\)-fold cross-validation

Pruning Auto tree

Pruning Auto tree

Optimal Auto tree

Classification trees

  • Class prediction
  • Class proportion
  • Classification error rate

    \[E = 1 - \max_{k}(\hat{p}_{mk})\]

  • Gini index

    \[G = \sum_{k = 1}^k \hat{p}_{mk} (1 - \hat{p}_{mk})\]

  • Cross-entropy

    \[D = - \sum_{k = 1}^K \hat{p}_{mk} \log(\hat{p}_{mk})\]

Titanic tree

Titanic tree

Optimal Titanic tree

Increasing node purity

Males older than or equal to 13 on the Titanic
Outcome Number of training observations
Died 344
Survived 72

Increasing node purity

Less than 24.75 years old Outcome Number of training observations
FALSE Died 232
FALSE Survived 60
TRUE Died 112
TRUE Survived 12

Trees vs. regression

  • Linear functional form

    \[f(X) = \beta_0 + \sum_{j = 1}^p X_j \beta_j\]

  • Decision tree functional form

    \[f(X) = \sum_{m = 1}^M c_m \cdot 1_{X \in R_m}\]

Benefits/drawbacks to decision trees

  • Easy to explain
  • Visualizing
  • Qualitative predictors are easily handled
  • Accuracy rates are lower
  • Trees can be non-robust

Non-robust tree

Bagging

  • Trees -> high variance and low bias
  • Want low variance and low bias
  • Bootstrap aggregating

    \[\hat{f}_{\text{avg}}(x) = \frac{1}{B} \sum_{b = 1}^B \hat{f}^b(x)\]

    \[\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b = 1}^B \hat{f}^b(x)\]

Out-of-bag estimates

Out-of-bag estimates

## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_rf_data,      mtry = 7, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 20.6%
## Confusion matrix:
##          Died Survived class.error
## Died      357       67       0.158
## Survived   80      210       0.276
Estimation time for OOB error rate
##    user  system elapsed 
##   0.291   0.005   0.297
Estimation time for \(10\)-fold CV error rate
##    user  system elapsed 
##   3.570   0.133   3.759

Variable importance measures

Random forests

  • Correlated trees
  • Random forests

Bagged vs. random forest

## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_rf_data,      mtry = 7, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 20.6%
## Confusion matrix:
##          Died Survived class.error
## Died      357       67       0.158
## Survived   80      210       0.276
## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_rf_data,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 18.4%
## Confusion matrix:
##          Died Survived class.error
## Died      381       43       0.101
## Survived   88      202       0.303

Bagged vs. random forest

Variable used to generate the first split in each tree
Variable used to split Number of training observations
Female 500
Variable used to generate the first split in each tree
Variable used to split Number of training observations
Age 39
Embarked 61
Fare 113
Female 120
Parch 31
Pclass 127
SibSp 9

Bagged vs. random forest

Boosting

  • Sequential growth
  • Predict residuals, not response variable \(Y\)
  • Updating over time
  • Learning slowly

Boosting parameters

  1. The number of trees \(B\)
  2. The shrinkage parameter \(\lambda\)
  3. The number of \(d\) splits in each tree

Method comparison

## Distribution not specified, assuming bernoulli ...
## Distribution not specified, assuming bernoulli ...
## Distribution not specified, assuming bernoulli ...

Optimal iterations for boosting

## Using OOB method...
## Using OOB method...
## Using OOB method...
Optimal number of boosting iterations
Depth Optimal number of iterations
1 3449
2 3210
4 2360